fix: close stale relay connection on ErrConnAlreadyExists to recover … by fpenezic · Pull Request #5866 · netbirdio/netbird

fpenezic · 2026-04-12T18:44:58Z

Describe your changes

Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports Connected, with the connection never recovering on its own.

This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

When the peer is running over relay and a network event invalidates the existing relay session, WorkerRelay.OnNewOffer calls relayManager.OpenConn which returns ErrConnAlreadyExists because the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.

Fix (commits 1–4, 6):

On ErrConnAlreadyExists, close the existing relay conn and reopen it.
Route CloseConnByPeerKey to the correct relay client (home vs. foreign server).
Take rt.RLock() when reading RelayTrack.relayClient to avoid a race with openConnVia() which initializes that field under rt.Lock().
Gate the close-and-retry behind a relayConnStale atomic.Bool flag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).
Signal sources for the stale flag: conn.onWGDisconnected (Relay path), WorkerRelay.CloseConn, WorkerRelay.onRelayClientDisconnected.

Problem 2 - ICE path: session ID not rotated on WG timeout

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent = nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the w.agent == agent guard fails (w.agent is already nil) and the session ID is not rotated.

Without rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's SessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.

Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows Connected, P2P while last successful WG handshake is 40+ minutes old.

Fix (commit 5): Rotate the session ID explicitly in onWGDisconnected for the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.

Test plan

Local build passes (go build ./...)
Deployed on two production peers (RPi behind PPPoE NAT, OCI public IP) for live testing
Verified stable P2P connection with WG handshake refreshing on the normal 3-minute cadence (no regression)
Verified no infinite tear-down loop on relay-active peer (the bug introduced by the v1 fix and resolved in v2)
Validated against multiple real PPPoE reconnects with public IP rotation: tunnel recovered automatically in a single ICE → Relay → ICE transition (~4 seconds), no loop, WG handshake refreshed normally on the 3-minute cadence afterwards. Stable for hours post-event.

Caveat: unrelated Rosenpass interaction

During testing we hit a separate failure mode where Rosenpass (post-quantum PSK) was enabled in permissive mode on both peers. After an IP change, the WG tunnel could not recover because the local PSK state diverged from the remote peer's PSK (one side had a Rosenpass-managed PSK, the other had none). This produced the same external symptom (silent tunnel, UI says Connected) but is not addressed by this PR - the fixes here only cover the relay/ICE transport state. With Rosenpass disabled, both fixes recovered the tunnel cleanly across multiple PPPoE cycles. Filing the Rosenpass interaction as a separate issue.

Notes

The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:

Without the relay fix, peers stuck on relay loop the relay close/reopen cycle.
Without the ICE fix, peers that fall back to relay then re-establish ICE get pinned to a stale ICE agent on the remote side.

Both fixes are strictly defensive - they can only fire on ErrConnAlreadyExists (relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.

Issue ticket number and link

N/A - bug discovered in production, no pre-existing issue.

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

This is an internal recovery-path fix in the peer connection state machine. No public API, CLI flag, config field, or user-visible behavior changes - the fixes are strictly defensive and only fire on specific error conditions (ErrConnAlreadyExists for relay, WG handshake timeout for ICE). Nothing to document for end users or operators.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved connection timeout handling to better recover from WireGuard handshake and ICE/TURN failures by resetting session negotiation and retrying stale relay connections.
- Enhanced peer connection stability through more robust relay connection cleanup and retry logic during connection collisions.

CLAassistant · 2026-04-12T18:45:10Z

All committers have signed the CLA.

coderabbitai · 2026-04-12T18:45:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 656c844c-5e27-454a-b21e-9383c92984a4

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and 8c4ba8b.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

✅ Files skipped from review due to trivial changes (1)

client/internal/peer/conn.go

🚧 Files skipped from review as they are similar to previous changes (4)

client/internal/peer/worker_ice.go
shared/relay/client/manager.go
shared/relay/client/client.go
client/internal/peer/worker_relay.go

📝 Walkthrough

Walkthrough

On WireGuard relay handshake timeout the relay worker entry is marked stale before closing the relay connection and running relay-disconnect handling; after relay cleanup the ICE worker (if present) is forced to use a new session ID. For ICE/TURN-priority timeouts the ICE worker rotates its local session ID before closing the ICE worker. New relay client/manager APIs allow closing a relay connection by peer key; WorkerRelay tracks stale entries and will evict-and-retry on duplicate-open collisions.

Changes

Conn + ICE

Layer / File(s)	Summary
Control / Callsite `client/internal/peer/conn.go`	On WireGuard relay disconnects call `workerRelay.MarkStale()` before closing relay connection and handling disconnect; for relay-priority timeouts, call `workerICE.ResetSessionID()` after relay cleanup. For ICE/TURN timeouts call `workerICE.ResetSessionID()` before `workerICE.Close()`.
Core Implementation `client/internal/peer/worker_ice.go`	Added `WorkerICE.ResetSessionID()` which locks `muxAgent`, generates a new `sessionID` via `NewICESessionID()`, and clears `remoteSessionID` on success; logs and returns on ID generation failure.

Relay worker

Layer / File(s)	Summary
State `client/internal/peer/worker_relay.go`	Add `relayConnStale` (`atomic.Bool`) to track stale relay conn entries.
Offer Handling `client/internal/peer/worker_relay.go`	On `relayClient.ErrConnAlreadyExists` either reuse existing conn when not stale, or call `CloseConnByPeerKey` to evict the stale entry and retry `OpenConn` once; clear stale flag on successful retry.
Lifecycle / API `client/internal/peer/worker_relay.go`	Set `relayConnStale` to true in `CloseConn()` and `onRelayClientDisconnected()`; add exported `MarkStale()` to mark the entry stale externally.

Relay client & manager

Layer / File(s)	Summary
Client API `shared/relay/client/client.go`	Added `Client.CloseConnByPeerKey(peerKey string)` to evict/close a connection by hashed `peerKey`: removes mapping from `conns`, unsubscribes peer-state notifications, and closes the `connContainer`.
Manager Routing `shared/relay/client/manager.go`	Added `Manager.CloseConnByPeerKey(serverAddress, peerKey string)` that prefers closing on the home client when `ServerInstanceURL()` matches, otherwise looks up the foreign relay client track for `serverAddress` and calls its `CloseConnByPeerKey`; no-op silently when clients/tracks are missing.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: onWGDisconnected (priority==Relay) / MarkStale()
  WorkerRelay->>RelayClient: CloseConnByPeerKey(peerKey)
  RelayClient-->>WorkerRelay: confirm closed
  WorkerRelay->>WorkerICE: ResetSessionID() (if exists)
  WorkerICE-->>WorkerRelay: new sessionID, remoteSessionID cleared
  WorkerRelay->>Conn: continue disconnect handling / retry OpenConn if needed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

netbirdio/netbird#5283: Related ICE session handling and propagation into WireGuard/relay fallback.
netbirdio/netbird#5265: Related changes to relay connection lifecycle and conn map handling.
netbirdio/netbird#6015: Related eviction/forced-close logic in the relay client manager.

Suggested reviewers

pappz
mlsmaycon

Poem

🐰 I twitch my whiskers when sessions stale,
I nudge old relays down the trail,
I spin a new ICE name with glee,
Evict the stale, retry the spree,
Fresh hops ahead — hop, hop, hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: close stale relay connection on ErrConnAlreadyExists to recover...' clearly and specifically describes the main relay-path fix and primary problem being addressed.
Description check	✅ Passed	The PR description comprehensively covers required template sections: detailed problem explanation, specific fixes with commit references, test plan validation, bug fix checklist marked, and documentation justification provided.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e

📥 Commits

Reviewing files that changed from the base of the PR and between a9783e0 and 7d55ac2.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

🚧 Files skipped from review as they are similar to previous changes (3)

client/internal/peer/conn.go
shared/relay/client/manager.go
shared/relay/client/client.go

fpenezic · 2026-04-22T22:04:12Z

Converting to draft - I discovered a regression during extended testing.

Bug: The ErrConnAlreadyExists retry loop can fire repeatedly (observed 3600+ times over ~1 hour) in multi-peer scenarios. When remote peers send offers in rapid succession (session renewal, ICE state changes, etc. - not only after a disconnect), each offer triggers close + reopen, and the remote side sees the reconnect as another state change and sends a fresh offer, producing an infinite tear-down/rebuild loop. Each iteration allocates a new wgProxy port on 127.0.0.1 while the previous one fails with use of closed network connection, generating significant background traffic (observed 9 Mbit/s constant on the affected peer).

Root cause: my earlier reasoning that OnNewOffer is only invoked after Guard detects a disconnect was incorrect. OnNewOffer also fires on normal offer/answer exchanges, so unconditionally treating ErrConnAlreadyExists as "stale by contract" is wrong - the original CodeRabbit concern about healthy connections being torn down was valid.

Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:

Cooldown: skip close if the existing conn was (re)created within the last N seconds
Liveness check via WGWatcher handshake state
Per-peer serialization of OnNewOffer so rapid back-to-back offers don't overlap

Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation).

…tunnel after NAT IP change

…vers

The previous fix unconditionally closed and reopened the existing relay conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an infinite tear-down/rebuild loop when the remote peer sent rapid successive offers (e.g. during reconnection): close → reopen → remote sees teardown → sends new offer → ErrConnAlreadyExists → close → loop. Introduce a relayConnStale atomic flag that is set only when an event indicates the existing entry is no longer backed by a live peer session: local WG handshake timeout, relay server close, or explicit CloseConn. OnNewOffer only tears down and reopens when the flag is set; otherwise it reuses the existing healthy connection. Signal sources for the flag: - conn.onWGDisconnected (Relay path) → MarkStale before CloseConn - WorkerRelay.CloseConn → marks stale on close - WorkerRelay.onRelayClientDisconnected → marks stale on server close

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the `w.agent == agent` guard fails (w.agent is already nil) and the session ID is not rotated. Without session ID rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's sessionID, finds them equal, and skips agent recreation — reusing the existing agent and its stale candidates from the broken network state. Observed in production: 30s reconnect loop after a NAT/conntrack event with the same offer session ID logged ~20+ times in a row, WG handshake never recovering despite "ICE connection succeeded" on every cycle. Mirror the existing Relay-path behavior: explicitly rotate the session ID before closing the ICE worker so the remote peer always recreates its agent after a WG timeout on ICE.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 168-172: The code reads rt.relayClient without holding the
RelayTrack lock; fix by grabbing rt.RLock() when accessing the field, copy the
pointer to a local (e.g. rc := rt.relayClient) while holding the RLock, then
RUnlock and call rc.CloseConnByPeerKey(peerKey) only if rc != nil; this protects
RelayTrack.relayClient (used/initialized under rt.Lock() in openConnVia()) while
avoiding holding the track lock during the CloseConnByPeerKey call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b1aed8a-feb3-4803-b3cf-18c9a265d315

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and c0b3e4e.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

…eerKey CloseConnByPeerKey was reading rt.relayClient without holding rt.RLock(), but openConnVia() initializes that field under rt.Lock(). Race could skip the stale-entry close while the track was still being populated, defeating the foreign-server cleanup path. Take rt.RLock() to copy the pointer to a local, release the track lock, then call CloseConnByPeerKey on the copy — protects the field without holding the track lock across a potentially blocking network call. Addresses CodeRabbit review on PR netbirdio#5866.

sonarqubecloud · 2026-04-28T08:13:07Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fpenezic · 2026-05-05T20:07:52Z

@coderabbitai resume

coderabbitai · 2026-05-05T20:08:00Z

✅ Actions performed

Reviews resumed.

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread shared/relay/client/manager.go

pappz self-requested a review April 13, 2026 07:53

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from a9783e0 to 7d55ac2 Compare April 20, 2026 21:19

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread client/internal/peer/worker_relay.go

fpenezic marked this pull request as draft April 22, 2026 22:02

fpenezic added 5 commits April 25, 2026 10:52

fix: close stale relay connection on ErrConnAlreadyExists to recover …

46a5a7d

…tunnel after NAT IP change

fix: route CloseConnByPeerKey to correct relay client for foreign ser…

0f1fc4f

…vers

docs: explain Guard invariant for ErrConnAlreadyExists branch

159183b

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from b0b9b85 to c0b3e4e Compare April 25, 2026 09:32

fpenezic marked this pull request as ready for review April 28, 2026 06:32

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread shared/relay/client/manager.go Outdated

Uh oh!

Conversation

fpenezic commented Apr 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Problem 1 - Relay path: stale ErrConnAlreadyExists reuse

Problem 2 - ICE path: session ID not rotated on WG timeout

Test plan

Caveat: unrelated Rosenpass interaction

Notes

Issue ticket number and link

Stack

Checklist

Documentation

Summary by CodeRabbit

Release Notes

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fpenezic commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 28, 2026

Quality Gate passed

Uh oh!

fpenezic commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fpenezic commented Apr 12, 2026 •

edited by coderabbitai Bot

Loading

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

CLAassistant commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading